-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Max instance lifetime #839
base: main
Are you sure you want to change the base?
Conversation
Taking a quote from #836:
In v4 of the stack and earlier there was a lifecycle hook to help make instance shutdown more graceful and give the agents time to finish jobs. It was essential in those versions, as we used AWS driven autoscaling, and any instance (even those running jobs) could be marked for termination if the ASG decided it wanted to scale down. In v5 we fully switched to the current method of scale down, where the agent self terminates the instance after an idle period. We removed the lifecycle hooks, with the understanding that they weren't needed any more. It's possible that we may want to add them back though, as there are still scenarios where AWS can decide to terminate an instance and the instance could be running job(s). One possibility is AZ rebalancing (which we worked a round in #751). Another could be introducing instance lifetimes. Without lifecycle hooks, I think this PR is highly likely to interrupt running jobs. Sorry! I think it's a neat feature though, and think we have customers that would use it. |
I would love to use this at Robinhood. Right now I have to set |
@nitrocode there's a merge conflict, would you mind updating your branch? |
@keithduncan @yob anything blocking this? Just curious if this needs changes or is good to ship.
Couldn't this still be added? The default behavior would be the same, and if a user enabled this, they are doing so at their own risk... |
Better support for ASG initiated instance termination is the blocker I’m afraid.
They could! And they are the approach I’m expecting us to use for this. Unfortunately I’m not comfortable merging this without a reliable way to handle ASG initiated instance termination. The absence of this today already causes strife so I’m very hesitant to add another source of it. For those who are comfortable with the risk, it’s possible to apply a forked copy of this template to your stacks. I’d also be thrilled to collaborate on a Lambda + SSM based lifecycle hooks implementation for the ASG events 😄 I believe it could solve several problems at once (e.g. #927 #838) if proven to be reliable, and enable features like agent instance quarantine/cordoning. |
@@ -1054,6 +1054,8 @@ Resources: | |||
- OldestLaunchConfiguration | |||
- ClosestToNextInstanceHour | |||
MaxInstanceLifetime: !If [UseMaxInstanceLifetime, !Ref MaxInstanceLifetime, !Ref "AWS::NoValue"] | |||
NewInstancesProtectedFromScaleIn: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nitrocode I suspect (but haven’t tested) adding this by merging master will break the MaxInstanceLifetime
functionality because none of the instances are eligible for ASG initiated termination. I’m not sure if you or anyone else are running stacks from this branch’s template but wanted to call it out in case you are 😄
I'm not sure why #964 was closed, but we'd like to see this feature merged |
Related to PR #836
If this number is set to 86400 or greater, the instance will terminated.